Floods are among the most destructive natural disasters. In the United States, natural disaster recovery is managed by (FEMA) Federal Emergency Managment Agency. One interesting aspect of floods is that they can occur almost anywhere. Assemble a dataset to investigate floods in 2020-2021. Use the data resources below to assemble your data. Clean and organize the data. Write an EDA report.
Questions to be answered: How dangerous are floods? How expensive? Is there any pattern to the kinds of communities that suffer losses from floods?
There are three files in total. Datasets from NOAA, datasets from FEMA and census data.
NOAA: https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/ FEMA1: https://www.fema.gov/openfema-data-page/fema-web-disaster-summaries-v1 FEMA2: https://www.fema.gov/openfema-data-page/disaster-declarations-summaries-v2 Census Data: https://www.census.gov (somewhere inside)
This report focuses on data from 2020 to 2021.
# Import Datasets from NOAA
detail2020 <- read.csv("/Users/thomas/Downloads/ma615/midterm/NOAA/StormEvents_details-ftp_v1.0_d2020_c20230927.csv")
detail2021 <- read.csv("/Users/thomas/Downloads/ma615/midterm/NOAA/StormEvents_details-ftp_v1.0_d2021_c20231017.csv")
location2020 <- read.csv("/Users/thomas/Downloads/ma615/midterm/NOAA/StormEvents_locations-ftp_v1.0_d2020_c20231017.csv")
location2021 <- read.csv("/Users/thomas/Downloads/ma615/midterm/NOAA/StormEvents_locations-ftp_v1.0_d2021_c20231017.csv")
fata2020 <- read.csv("/Users/thomas/Downloads/ma615/midterm/NOAA/StormEvents_fatalities-ftp_v1.0_d2020_c20230927.csv")
fata2021 <- read.csv("/Users/thomas/Downloads/ma615/midterm/NOAA/StormEvents_fatalities-ftp_v1.0_d2021_c20231017.csv")
The detail, location and fatality datasets can be merged into one single dataset due to some common columns. This project only focuses on “FLOOD” as a disaster type. There should be four types of flodds in total: “Flood”, “Lakeshore Flood”, “Flash Flood”, “Coastal Flood”.
# Validate keys
if(any(duplicated(detail2020$EVENT_ID)) | any(duplicated(detail2021$EVENT_ID))) {
stop("Duplicate EVENT_IDs found in detail datasets.")
}
# Pre-process datasets (if necessary, based on your analysis needs)
# Merge in steps
combinedDetails <- rbind(detail2020, detail2021)
combinedLocations <- rbind(location2020, location2021)
combinedFatalities <- rbind(fata2020, fata2021)
combined <- merge(combinedDetails, combinedLocations, by = c("EVENT_ID", "EPISODE_ID"), all = TRUE)
combined <- merge(combined, combinedFatalities, by = "EVENT_ID", all.x = TRUE)
# Post-process merged dataset
combined <- combined[!duplicated(combined$EVENT_ID), ]
# Consistency check (based on your analysis needs)
# Define the types of floods to filter
floodTypes <- c("Flood", "Coastal Flood", "Flash Flood", "Lakeshore Flood")
# Filter for Flood events
combined <- combined[combined$EVENT_TYPE %in% floodTypes, ]
# Get unique incidents of flood and print them
unique_incidents <- unique(combined$EVENT_TYPE)
print(unique_incidents)
## [1] "Flood" "Lakeshore Flood" "Flash Flood" "Coastal Flood"
The “combined” dataset is the raw dataset that I have right now. This step is to remove some columns that are not helpful for EDA, and I am going to choose some columns that I am interested.
#colnames(combined)
combined <- combined[, -which(names(combined) %in% c("WFO", "CZ_TIMEZONE", "SOURCE","TOR_WIDTH", "TOR_LENGTH", "TOR_OTHER_CZ_FIPS", "TOR_OTHER_CZ_NAME", "TOR_OTHER_CZ_STATE", "TOR_F_SCALE","CZ_TYPE","CZ_FIPS","CZ_NAME","EVENT_NARRATIVE","EPISODE_NARRATIVE","END_AZIMUTH","END_RANGE","BEGIN_RANGE","BEGIN_AZIMUTH","LAT2","LON2","RANGE","AZIMUTH","LOCATION_INDEX","STATE_FLIPS"))]
colnames(combined)
## [1] "EVENT_ID" "EPISODE_ID" "BEGIN_YEARMONTH"
## [4] "BEGIN_DAY" "BEGIN_TIME" "END_YEARMONTH"
## [7] "END_DAY" "END_TIME" "STATE"
## [10] "STATE_FIPS" "YEAR" "MONTH_NAME"
## [13] "EVENT_TYPE" "BEGIN_DATE_TIME" "END_DATE_TIME"
## [16] "INJURIES_DIRECT" "INJURIES_INDIRECT" "DEATHS_DIRECT"
## [19] "DEATHS_INDIRECT" "DAMAGE_PROPERTY" "DAMAGE_CROPS"
## [22] "MAGNITUDE" "MAGNITUDE_TYPE" "FLOOD_CAUSE"
## [25] "CATEGORY" "TOR_OTHER_WFO" "BEGIN_LOCATION"
## [28] "END_LOCATION" "BEGIN_LAT" "BEGIN_LON"
## [31] "END_LAT" "END_LON" "DATA_SOURCE"
## [34] "YEARMONTH" "LOCATION" "LATITUDE"
## [37] "LONGITUDE" "FAT_YEARMONTH" "FAT_DAY"
## [40] "FAT_TIME" "FATALITY_ID" "FATALITY_TYPE"
## [43] "FATALITY_DATE" "FATALITY_AGE" "FATALITY_SEX"
## [46] "FATALITY_LOCATION" "EVENT_YEARMONTH"
I am going to use some columns for later analysis. To be noticed, I have converted the incident begin date here (where incident corresponds to any flood).
dangerousflood <- combined[, c("EVENT_ID", "EPISODE_ID", "STATE", "YEAR", "MONTH_NAME", "EVENT_TYPE","INJURIES_DIRECT", "INJURIES_INDIRECT", "DEATHS_DIRECT", "DEATHS_INDIRECT", "DAMAGE_PROPERTY", "DAMAGE_CROPS", "FLOOD_CAUSE", "CATEGORY","BEGIN_LOCATION", "END_LOCATION", "LATITUDE", "LONGITUDE","FATALITY_AGE", "FATALITY_SEX","FATALITY_LOCATION","BEGIN_DATE_TIME","MAGNITUDE_TYPE","MAGNITUDE")]
# Convert BEGIN_DATE_TIME to Date format
dangerousflood$BEGIN_DATE <- as.Date(dangerousflood$BEGIN_DATE_TIME, format = "%d-%b-%y %H:%M:%S")
# Check the first few dates to ensure they were converted correctly
head(dangerousflood$BEGIN_DATE)
## [1] "2020-01-01" "2020-01-01" "2020-01-01" "2020-01-11" "2020-01-11"
## [6] "2020-01-11"
Well, I dont’ find this dataset particular useful for now.
#census 5225
meta2020_1_5225 <- read.csv("/Users/thomas/Downloads/ma615/midterm/Census Download_2023-10-23T135225/ACSST5Y2020.S1701-Column-Metadata.csv")
meta2020_2_5225 <- read.csv("/Users/thomas/Downloads/ma615/midterm/Census Download_2023-10-23T135225/ACSST5Y2020.S1701-Data.csv", skip = 1)
meta2021_1_5225 <- read.csv("/Users/thomas/Downloads/ma615/midterm/Census Download_2023-10-23T135225/ACSST5Y2021.S1701-Column-Metadata.csv")
meta2021_2_5225 <- read.csv("/Users/thomas/Downloads/ma615/midterm/Census Download_2023-10-23T135225/ACSST5Y2021.S1701-Data.csv")
#census 0133
meta2020_1_0133 <- read.csv("/Users/thomas/Downloads/ma615/midterm/Census Download_2023-10-23T140133/ACSDP5Y2020.DP05-Column-Metadata.csv")
meta2020_2_0133 <- read.csv("/Users/thomas/Downloads/ma615/midterm/Census Download_2023-10-23T140133/ACSDP5Y2020.DP05-Data.csv")
meta2021_1_0133 <- read.csv("/Users/thomas/Downloads/ma615/midterm/Census Download_2023-10-23T140133/ACSDP5Y2021.DP05-Column-Metadata.csv")
meta2021_2_0133 <- read.csv("/Users/thomas/Downloads/ma615/midterm/Census Download_2023-10-23T140133/ACSDP5Y2021.DP05-Data.csv")
#census 0147
meta2020_1_0147 <- read.csv("/Users/thomas/Downloads/ma615/midterm/Census Download_2023-10-23T140147/ACSDT5Y2020.B25001-Column-Metadata.csv")
meta2020_2_0147 <- read.csv("/Users/thomas/Downloads/ma615/midterm/Census Download_2023-10-23T140147/ACSDT5Y2020.B25001-Data.csv")
meta2021_1_0147 <- read.csv("/Users/thomas/Downloads/ma615/midterm/Census Download_2023-10-23T140147/ACSDT5Y2021.B25001-Column-Metadata.csv")
meta2021_2_0147 <- read.csv("/Users/thomas/Downloads/ma615/midterm/Census Download_2023-10-23T140147/ACSDT5Y2021.B25001-Data.csv")
I found API endpoints from FEMA website and downloaded the data directly using API for both FEMA datasets.
#Import Datasets from FEMA v1
data_url <- "https://www.fema.gov/api/open/v1/FemaWebDisasterSummaries.json"
# Making the GET request to the data URL
response <- GET(url = data_url)
# Check the response status code
print(status_code(response))
## [1] 200
# Parse the content of the response into a dataframe
data <- fromJSON(rawToChar(response$content))
# Convert the data to a dataframe
floodv1 <- as.data.frame(data$FemaWebDisasterSummaries)
Time for another FEMA datasets, however the sample size is small. I will pause it for now.
# Import Datasets from FEMA v2
# Correctly encode the URL
data_url <- "https://www.fema.gov/api/open/v2/DisasterDeclarationsSummaries.json?$filter=incidentType%20eq%20%27Flood%27%20and%20(incidentBeginDate%20ge%20%272020-01-01T00:00:00.000Z%27%20and%20incidentBeginDate%20le%20%272021-12-31T23:59:59.000Z%27)"
# Making the GET request to the data URL
response <- GET(url = data_url)
# Check the response status code
print(status_code(response))
## [1] 200
# Parse the content of the response into a dataframe
data <- fromJSON(rawToChar(response$content))
# Convert the data to a dataframe
floodv2 <- as.data.frame(data$DisasterDeclarationsSummaries)
# Check the unique values in the incidentType column
unique_incidents <- unique(floodv2$incidentType)
# Print the unique incident types
print(unique_incidents)
## [1] "Flood"
I tried to merge them into one dataset.
combinedFEMA <- floodv1 %>%
inner_join(floodv2, by = "disasterNumber")
I decided to use NOAA for my report as there are many insightful columns for analysis.
# glimpse
glimpse(dangerousflood)
## Rows: 13,661
## Columns: 25
## $ EVENT_ID <int> 863537, 863585, 863586, 864250, 864251, 864461, 8644…
## $ EPISODE_ID <int> 143838, 143845, 143845, 143983, 143983, 144018, 1440…
## $ STATE <chr> "WISCONSIN", "WISCONSIN", "WISCONSIN", "WISCONSIN", …
## $ YEAR <int> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020…
## $ MONTH_NAME <chr> "January", "January", "January", "January", "January…
## $ EVENT_TYPE <chr> "Flood", "Flood", "Flood", "Lakeshore Flood", "Lakes…
## $ INJURIES_DIRECT <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ INJURIES_INDIRECT <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ DEATHS_DIRECT <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ DEATHS_INDIRECT <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ DAMAGE_PROPERTY <chr> "96.00K", "0.00K", "0.00K", "10.70M", "0.00K", "50.0…
## $ DAMAGE_CROPS <chr> "0.00K", "0.00K", "0.00K", "0.00K", "0.00K", "0.00K"…
## $ FLOOD_CAUSE <chr> "Ice Jam", "Heavy Rain / Snow Melt", "Heavy Rain / S…
## $ CATEGORY <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ BEGIN_LOCATION <chr> "OCONTO", "KEWAUNEE", "OCONTO", "", "", "MATNEY", "S…
## $ END_LOCATION <chr> "OCONTO", "KEWAUNEE", "OCONTO", "", "", "KELLERSVILL…
## $ LATITUDE <dbl> 44.8939, 44.4607, 44.9116, NA, NA, 36.2495, 36.5023,…
## $ LONGITUDE <dbl> -87.8492, -87.5019, -87.8422, NA, NA, -81.8980, -81.…
## $ FATALITY_AGE <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ FATALITY_SEX <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ FATALITY_LOCATION <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ BEGIN_DATE_TIME <chr> "01-JAN-20 00:00:00", "01-JAN-20 00:00:00", "01-JAN-…
## $ MAGNITUDE_TYPE <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", …
## $ MAGNITUDE <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ BEGIN_DATE <date> 2020-01-01, 2020-01-01, 2020-01-01, 2020-01-11, 202…
I would like to know which state have the most counts of flood, top 5 states that have the highest frequency of being attacked by flood, and the property damage + crop damage for each state.
# Check all states
state_all <- dangerousflood |> group_by(STATE) |> count()
# if(sum(state_all$n) == dim(dangerousflood)[1])
{print("Every row has value in the State column.")}
## [1] "Every row has value in the State column."
print(state_all)
## # A tibble: 55 × 2
## # Groups: STATE [55]
## STATE n
## <chr> <int>
## 1 ALABAMA 336
## 2 ALASKA 25
## 3 AMERICAN SAMOA 11
## 4 ARIZONA 454
## 5 ARKANSAS 460
## 6 CALIFORNIA 367
## 7 COLORADO 147
## 8 CONNECTICUT 86
## 9 DELAWARE 38
## 10 DISTRICT OF COLUMBIA 29
## # ℹ 45 more rows
# The most counts of states
state_max <- state_all$STATE[which(state_all$n == max(state_all$n) )]
print(state_max)
## [1] "VIRGINIA"
# Group by STATE and count the number of occurrences
state_all <- dangerousflood %>%
group_by(STATE) %>%
summarise(Count= n())
# Top counts of states
top_states <- state_all %>%
top_n(5, Count)
print(top_states)
## # A tibble: 5 × 2
## STATE Count
## <chr> <int>
## 1 KENTUCKY 696
## 2 MISSOURI 797
## 3 NEW YORK 611
## 4 TEXAS 751
## 5 VIRGINIA 1040
states_map <- map_data("state")
It appears that flood occurrences are not evenly distributed across the United States, with some states experiencing a higher frequency of events. Notably, Virginia stands out with a total of 1,040 reported flood incidents for the years 2020 and 2021. On average, Virginia experienced nearly three flood events every day and this high frequency highlights the significant and persistent challenge for government and people.
Missouri had 797 events in total. Kentucky, Texas, and New York also recorded 696, 751, and 611 occurrences respectively.
Now it’s time to visualize the financial/agricultural damages of flood.
# Function to convert damage values to numeric
convert_damage_to_numeric <- function(damage) {
if (is.na(damage) || damage == "") {
return(NA)
}
# Remove commas, convert to upper case, and trim leading/trailing whitespace
damage <- gsub(",", "", damage)
damage <- toupper(trimws(as.character(damage)))
# Convert based on suffixes
multiplier <- 1
if (grepl("K$", damage)) {
multiplier <- 1e3
damage <- gsub("K$", "", damage)
} else if (grepl("M$", damage)) {
multiplier <- 1e6
damage <- gsub("M$", "", damage)
} else if (grepl("B$", damage)) {
multiplier <- 1e9
damage <- gsub("B$", "", damage)
}
# Convert to numeric and multiply by the appropriate factor
numeric_damage <- as.numeric(damage)
if (!is.na(numeric_damage)) {
return(numeric_damage * multiplier)
} else {
return(NA)
}
}
# Apply the function to the DAMAGE_PROPERTY and DAMAGE_CROPS columns
dangerousflood$DAMAGE_PROPERTY_NUM <- sapply(dangerousflood$DAMAGE_PROPERTY, convert_damage_to_numeric)
dangerousflood$DAMAGE_CROPS_NUM <- sapply(dangerousflood$DAMAGE_CROPS, convert_damage_to_numeric)
# Now, summarize the total damage by state using the new numeric columns
state_damage <- dangerousflood %>%
group_by(STATE) %>%
summarise(Total_Damage_Property = sum(DAMAGE_PROPERTY_NUM, na.rm = TRUE),
Total_Damage_Crops = sum(DAMAGE_CROPS_NUM, na.rm = TRUE))
# Check the summary data frame
print(state_damage)
## # A tibble: 55 × 3
## STATE Total_Damage_Property Total_Damage_Crops
## <chr> <dbl> <dbl>
## 1 ALABAMA 4985000 0
## 2 ALASKA 15429740 0
## 3 AMERICAN SAMOA 701000 81000
## 4 ARIZONA 87238500 0
## 5 ARKANSAS 1548000 0
## 6 CALIFORNIA 22385000 18000
## 7 COLORADO 2106710 425240
## 8 CONNECTICUT 7110510 0
## 9 DELAWARE 3600000 0
## 10 DISTRICT OF COLUMBIA 100000 0
## # ℹ 45 more rows
# Plot total property damage by state
ggplot(state_damage, aes(x = reorder(STATE, -Total_Damage_Property), y = Total_Damage_Property)) +
geom_bar(stat = "identity", fill = "steelblue") +
labs(title = "Total Property Damage by State",
x = "State",
y = "Total Property Damage (USD)") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
The data indicates that New York incurs the highest total property damage from floods (over 15000k), when measured in thousands of U.S. dollars, followed by Louisana, Michigan, and Pennsylvania. This finding is consistent with the high frequency of flooding events reported in New York. Those state are mostly in eastern side of the state, suggest a higher flood impact on the east.
Virginia has a low total property damage than the above states, and interestingly, Missouri, despite having a high number of flood incidents, does not correspond with similarly high levels of property damage. That might suggest an effective flood management by the state government or some location advantages that mitigate the cost.
# Function to convert character damage values to numeric
convert_damage <- function(damage) {
if (is.na(damage)) {
return(NA_real_)
}
multiplier <- ifelse(grepl("K", damage), 1e3, ifelse(grepl("M", damage), 1e6, 1))
as.numeric(gsub("[^0-9.]", "", damage)) * multiplier
}
# Apply the function to the DAMAGE_PROPERTY column
dangerousflood <- dangerousflood %>%
mutate(DAMAGE_PROPERTY = sapply(DAMAGE_PROPERTY, convert_damage))
# Now let's try the summarise again
state_damage_totals <- dangerousflood %>%
group_by(STATE) %>%
summarise(Total_Damage_Property = sum(DAMAGE_PROPERTY, na.rm = TRUE)) %>%
mutate(STATE = tolower(STATE)) # Ensure state names match map data
# Merge map data with total property damages
map_data_merged <- merge(states_map, state_damage_totals, by.x = "region", by.y = "STATE", all.x = TRUE)
# Plot with more visible colors
ggplot() +
geom_polygon(data = map_data_merged, aes(x = long, y = lat, group = group, fill = Total_Damage_Property), color = "white") +
scale_fill_continuous(name = "Total Property Damage (USD)",
low = "lightblue1", high = "darkblue",
na.value = "grey50", # Grey color for states with NA values
labels = scales::comma) + # Format labels with commas for readability
labs(title = "Total Property Damage by State") +
theme_void() +
theme(legend.position = "right")
The distribution of flood-related property damage across the United
States reveals a distinct pattern, with the eastern regions, sustaining
the most significant financial impact.
The Middle to west and Northern states report relatively minimal damage, which might reflect less frequent flood events. Conversely, the Southern states experience severe property damage due to flooding. This severity could be due to a combination of factors, including the prevalence of hurricanes and tropical storms that bring intense rainfall. Also, coastal areas have a relatively high damage on property.
# Log transform of total crop damage
ggplot(state_damage, aes(x = reorder(STATE, -Total_Damage_Crops), y = log(Total_Damage_Crops + 1))) +
geom_bar(stat = "identity", fill = "cyan4") +
labs(title = "Log of Total Crop Damage by State",
x = "State",
y = "Log of Total Crop Damage") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
The distribution of flood-related crop damage across various states
exhibits a pronounced right skew, thus I take a log transformation to
make the plot easier to interpret.
However, it’s important to note that the absence of crop damage data for many states and I could not get those data. Apparently, the NOAA datasets have some ‘NA’ values where I should pay attention to. Despite this limitation, an examination of the available data reveals that states such as the Dakotas, Minnesota, and Iowa have sustained the most severe crop damage. This pattern may reflect these states’ considerable agricultural activities and their losses on agricultural lands due to flood.
# Ensure the states_map data is loaded
states_map <- map_data("state")
# Apply the function to the DAMAGE_CROPS column
dangerousflood <- dangerousflood %>%
mutate(DAMAGE_CROPS = sapply(DAMAGE_CROPS, convert_damage))
# Now let's summarize the crop damages by state
state_crop_damage_totals <- dangerousflood %>%
group_by(STATE) %>%
summarise(Total_Damage_Crops = sum(DAMAGE_CROPS, na.rm = TRUE)) %>%
mutate(STATE = tolower(STATE)) # Ensure state names match map data
# Merge the states_map with the summarized crop damages
map_data_merged_crops <- merge(states_map, state_crop_damage_totals, by.x = "region", by.y = "STATE", all.x = TRUE)
# Now create the plot with the merged data
ggplot() +
geom_polygon(data = map_data_merged_crops, aes(x = long, y = lat, group = group, fill = Total_Damage_Crops), color = "white") +
scale_fill_continuous(name = "Total Crop Damage",
trans = "log1p", # Log transformation to better visualize the range
low = "lightgreen", high = "darkgreen",
na.value = "grey50", # Grey color for states with NA values
breaks = scales::trans_breaks("log1p", function(x) 10^x), # Adjust breaks for log scale
labels = scales::trans_format("log1p", scales::math_format(10^.x))) + # Adjust labels for log scale
labs(title = "Total Crop Damage by State") +
theme_void() +
theme(legend.position = "right",
legend.key.width = unit(1.5, "cm"), # Adjust legend key width
legend.title = element_text(size = 10), # Adjust legend title size
legend.text = element_text(size = 8)) # Adjust legend text size
Visual analysis of the crop damage data illuminates a notable
concentration of severe impacts in the northern part of the United
States. States like the Dakotas, Minnesota, and Iowa, which are
characterized by extensive agricultural activity, appear to bear the
brunt of the damage. These regions, often referred to as the nation’s
breadbasket, may be particularly susceptible due to their reliance on
large-scale farming and the potential for flooding to disrupt vast
swathes of cropland.
In the South, Texas stands out with significant crop damage, possibly attributable to its size, diverse climate zones, and exposure to extreme weather events such as hurricanes, which can bring about widespread flooding. The damage in these southern areas, though less concentrated than in the North, still represents a substantial economic impact, highlighting the vulnerability of agricultural sectors to flooding across varied geographical landscapes.
Investigating the relationship between flood-related injuries and fatalities and their distribution among the states could provide insights into the human cost of these natural disasters and understand how dangerous flood is.
# Summarize death counts by state
death_counts_by_state <- dangerousflood %>%
group_by(STATE) %>%
summarise(Total_Deaths = sum(DEATHS_DIRECT + DEATHS_INDIRECT))
# Summarize injury counts by state
injury_counts_by_state <- dangerousflood %>%
group_by(STATE) %>%
summarise(Total_Injuries = sum(INJURIES_DIRECT + INJURIES_INDIRECT))
# Base plot with death counts in green
p_combined <- plot_ly(death_counts_by_state, x = ~reorder(STATE, -Total_Deaths), y = ~Total_Deaths, type = 'bar', marker = list(color = 'green'), name = 'Deaths') %>%
layout(title = "Total Counts Due to Floods by State",
xaxis = list(title = "State", tickangle = 45),
yaxis = list(title = "Total Counts"))
# Add injury counts to the same plot in blue
p_combined <- p_combined %>%
add_trace(data = injury_counts_by_state, x = ~reorder(STATE, -Total_Injuries), y = ~Total_Injuries, type = 'bar', marker = list(color = 'blue'), name = 'Injuries')
# Render the combined plot
p_combined
The data indicates that Tennessee and New York experience the highest numbers of deaths and injuries due to flooding, suggesting these states are particularly vulnerable to the more perilous consequences of such events. Texas and Arizona also report significant numbers, where probably due to the geographic location.
# Convert BEGIN_DATE_TIME to Date format
dangerousflood$BEGIN_DATE <- as.Date(dangerousflood$BEGIN_DATE_TIME, format = "%d-%b-%y %H:%M:%S")
# Aggregate injuries and deaths by date
injuries_deaths_by_date <- dangerousflood %>%
group_by(BEGIN_DATE) %>%
summarise(Total_Injuries = sum(INJURIES_DIRECT + INJURIES_INDIRECT),
Total_Deaths = sum(DEATHS_DIRECT + DEATHS_INDIRECT))
# Time Series Line Plot of Total Injuries Over Time
p_injuries <- plot_ly(data = injuries_deaths_by_date, x = ~BEGIN_DATE, y = ~Total_Injuries, type = 'scatter', mode = 'lines', line = list(color = 'cyan4'), name = 'Injuries') %>%
layout(title = "Time Series of Total Injuries & Deaths Over Time",
xaxis = list(title = "Date"),
yaxis = list(title = "Total Injuries"))
# Time Series Line Plot of Total Deaths Over Time
p_deaths <- plot_ly(data = injuries_deaths_by_date, x = ~BEGIN_DATE, y = ~Total_Deaths, type = 'scatter', mode = 'lines', line = list(color = 'red'), name = 'Deaths') %>%
layout(title = "Time Series of Total Injuries & Deaths Over Time",
xaxis = list(title = "Date"),
yaxis = list(title = "Total Deaths"))
# Arrange the plots in a grid
subplot(p_injuries, p_deaths, nrows = 2, shareX = TRUE, titleX = FALSE)
What I did here is to do a time series of how injuries and deaths occured in this particular timeframe. It turned out that between July 2021 and October 2021, a period that coincides with some of the most devastating floods. This trend underscores the heightened risk of serious harm during such events, as evidenced by the consistent occurrence of deaths throughout this timeframe. Thus, we can’t conclude flood as a seasonal event.This data highlights the critical importance of effective flood management and emergency response measures all over the year, particularly during periods of high risk.
The dataset contains four flood types and I am intrigued by the potential connections between flood type and causes. Thus, I built a contingency table that show all the detailed information. Such a table would organize the data to show the frequency of each cause within each flood type, allowing you to identify patterns or trends in the data.
# Contingency Table without NA values
# Contingency Table
table(dangerousflood$FLOOD_CAUSE, dangerousflood$EVENT_TYPE)
##
## Coastal Flood Flash Flood Flood Lakeshore Flood
## 404 0 0 156
## Dam / Levee Break 0 12 5 0
## Heavy Rain 0 7918 4455 0
## Heavy Rain / Burn Area 0 136 3 0
## Heavy Rain / Snow Melt 0 33 215 0
## Heavy Rain / Tropical System 0 266 27 0
## Ice Jam 0 1 26 0
## Planned Dam Release 0 0 4 0
The contingency table reveals a distinct distribution of causes across different flood types, with ‘Heavy Rain’ being the predominant cause for both ‘Flash Flood’ and the more general category labeled as ‘Flood.’ However, for ‘Lakeshore Flood’ and ‘Coastal Flood,’ the dataset does not provide information on the causes, leaving a gap in our understanding of the factors that contribute to these specific types of flooding. This absence of data prevents us from drawing concrete conclusions about what triggers ‘Lakeshore Flood’ and ‘Coastal Flood’ events, highlighting an area where further data collection and research are necessary to complete the picture of flood causality.
Thus, we created a new subset called ‘coastal_lakeshore_floods’, which only contains ‘Lakeshore Flood’ and ‘Coastal Flood’.
coastal_lakeshore_floods <- dangerousflood %>%
filter(grepl("Coastal Flood", EVENT_TYPE) | grepl("Lakeshore Flood", EVENT_TYPE))
I would like toexplore the relationship between flood occurrences and location.
# Filter for specific flood events using grepl for a pattern match
dangerousflood_filtered <- dangerousflood %>%
filter(grepl("Flood|Lakeshore Flood|Flash Flood|Coastal Flood", EVENT_TYPE)) %>%
filter(between(LATITUDE, 24, 49.3843), between(LONGITUDE, -125, -66.93457)) # Bounds for the continental US
# Get the map data for the US
us_map <- map_data("state")
# Plot the map with flood events
ggplot() +
geom_polygon(data = us_map, aes(x = long, y = lat, group = group), fill = "white", color = "black") +
geom_point(data = dangerousflood_filtered, aes(x = LONGITUDE, y = LATITUDE, color = EVENT_TYPE), size = 1, alpha = 0.8) +
scale_color_viridis_d(guide = guide_legend(override.aes = list(size=4))) + # Use a larger size for legend items
labs(title = "Flood Events in the United States by Type", x = "Longitude", y = "Latitude") +
theme_minimal()
The absence of precise longitude and latitude data for coastal and lakeshore floods in the dataset does present a limitation for detailed geospatial analysis. Nevertheless, it is reasonable to infer that such events predominantly occur along coastlines and lake shores. Despite this data gap, the overall trend suggests a higher incidence of floods in the eastern regions of the United States. This pattern may be influenced by a variety of factors, including regional climate conditions, topography, and the presence of water bodies such as rivers and estuaries that are prone to overflowing.
In contrast, the Western states, aside from coastal areas like California, Oregon, and Washington, report fewer flood events. This could reflect the region’s different climate and landscape characteristics, which include arid areas and mountainous terrains that are less susceptible to flooding compared to the flat and water-rich Eastern landscapes. However, coastal Western states, with their proximity to the Pacific Ocean, do experience their share of flood events, potentially related to Pacific storms and seasonal weather patterns.
# Summarize the data to get the count of flood events per type per state and average coordinates
dangerousflood_summary <- dangerousflood_filtered %>%
group_by(STATE, EVENT_TYPE) %>%
summarize(Count = n(),
AVG_LATITUDE = mean(LATITUDE, na.rm = TRUE),
AVG_LONGITUDE = mean(LONGITUDE, na.rm = TRUE)) %>%
ungroup()
## `summarise()` has grouped output by 'STATE'. You can override using the
## `.groups` argument.
# Get the map data for the US
us_map <- map_data("state")
# Define custom colors for the flood types
custom_colors <- c("Flood" = "blue", "Flash Flood" = "orange")
# Plot the map with flood events, varying the size of the dot by the count of events and using custom colors
ggplot() +
geom_polygon(data = us_map, aes(x = long, y = lat, group = group), fill = "white", color = "black") +
geom_point(data = dangerousflood_summary, aes(x = AVG_LONGITUDE, y = AVG_LATITUDE, size = Count, color = EVENT_TYPE), alpha = 0.6) +
scale_size(range = c(1, 6)) + # Adjust the range for the size of the points as needed, making them smaller
scale_color_manual(values = custom_colors) + # Use custom colors
labs(title = "Flood Events in the United States by Type", x = "Longitude", y = "Latitude") +
theme_minimal() +
theme(legend.position = "bottom") # Move legend to bottom for better visibility
# Remove the summary warning messages
dplyr::last_dplyr_warnings()
## list()
Now I gave a state-based visualization of flood events, categorized by type and delineated by state, reveals a pronounced concentration of larger circles—denoting a higher number of floods—in the eastern regions. This correlates with the elevated levels of property damage recorded in these areas. On the other hand, the West Coast, particularly Oregon, and Washington, displays a distinct pattern: these states experience a moderate prevalence of flood and flash flood events, yet with a comparatively lower impact on property damage (except for California).
Now financial losses can be high related with different types of flood that I would like to pursue investigating.
# Function to convert damage property values to numeric
convert_damage <- function(damage) {
if (is.na(damage) || damage == "") {
return(NA)
}
# Remove commas and convert to upper case
damage <- toupper(gsub(",", "", damage))
# Convert based on suffixes
if (grepl("K", damage)) {
return(as.numeric(gsub("K", "", damage)) * 1e3)
} else if (grepl("M", damage)) {
return(as.numeric(gsub("M", "", damage)) * 1e6)
} else if (grepl("B", damage)) {
return(as.numeric(gsub("B", "", damage)) * 1e9)
} else {
return(as.numeric(damage))
}
}
# Apply the function to the DAMAGE_PROPERTY column
dangerousflood$DAMAGE_PROPERTY_NUM <- sapply(dangerousflood$DAMAGE_PROPERTY, convert_damage, USE.NAMES = FALSE)
Thus, I create a graph visualization below:
# Summarize the damage by event type, omitting NAs
damage_summary <- dangerousflood %>%
filter(grepl("Flood|Lakeshore Flood|Flash Flood|Coastal Flood", EVENT_TYPE, ignore.case = TRUE)) %>%
filter(!is.na(DAMAGE_PROPERTY_NUM)) %>%
group_by(EVENT_TYPE) %>%
summarize(Total_Damage = sum(DAMAGE_PROPERTY_NUM, na.rm = TRUE),
Average_Damage = mean(DAMAGE_PROPERTY_NUM, na.rm = TRUE),
.groups = 'drop') # Drop grouping structure after summarizing
# Create a bar plot to compare total damage by event type
ggplot(damage_summary, aes(x = reorder(EVENT_TYPE, Total_Damage), y = Total_Damage, fill = EVENT_TYPE)) +
geom_bar(stat = "identity") +
theme_minimal() +
labs(title = "Total Property Damage by Flood Event Type", x = "Event Type", y = "Total Damage (USD)") +
scale_fill_viridis_d() # Use a viridis color scale for the fill
Flash floods stand out as the most financially destructive, incurring
property damage expenditures that far exceed those of general
floods—often referred to simply as ‘floods’ in the dataset—by
approximately threefold. The scarcity of occurrences of lakeshore and
coastal floods in the dataset precludes a comprehensive assessment of
their average financial toll. However, the available figures suggest
that while these flood types may be less frequent, when they do occur,
the property damage might be significant. Thus, I ploted another grpah
to indicate average losses.
# Calculate the average damage by event type, omitting NAs
average_damage_summary <- dangerousflood %>%
filter(grepl("Flood|Lakeshore Flood|Flash Flood|Coastal Flood", EVENT_TYPE, ignore.case = TRUE)) %>%
filter(!is.na(DAMAGE_PROPERTY_NUM)) %>%
group_by(EVENT_TYPE) %>%
summarize(Average_Damage = mean(DAMAGE_PROPERTY_NUM, na.rm = TRUE),
.groups = 'drop') # Drop grouping structure after summarizing
# Print the average damage summary
print(average_damage_summary)
## # A tibble: 4 × 2
## EVENT_TYPE Average_Damage
## <chr> <dbl>
## 1 Coastal Flood 28090.
## 2 Flash Flood 346041.
## 3 Flood 125967.
## 4 Lakeshore Flood 341298.
# Create a bar plot to compare average damage by event type
ggplot(average_damage_summary, aes(x = reorder(EVENT_TYPE, -Average_Damage), y = Average_Damage, fill = EVENT_TYPE)) +
geom_bar(stat = "identity") +
theme_minimal() +
labs(title = "Average Property Damage by Flood Event Type", x = "Event Type", y = "Average Damage (USD)") +
scale_fill_viridis_d() # Use a viridis color scale for the fill
The data and plot suggess that lakeshore flood, while less frequent, tend to be more devastating on average when they do occur. This could be attributed to the specific characteristics of lakeshore floods, which may impact areas with high property values or cause more extensive damage due to their nature. In contrast, general floods, often simply referred to as ‘floods,’ appear to be the least destructive on a per-event basis, which might earn them the moniker of being relatively ‘gentle’ floods. Nonetheless, it’s important to note that the average property damage per flood event across all types is substantial, drawing attention to the significant financial burden these natural disasters can impose.
# Function to convert damage crops values to numeric
convert_damage_crops <- function(damage) {
if (is.na(damage) || damage == "") {
return(NA)
}
# Remove commas and convert to upper case
damage <- toupper(gsub(",", "", damage))
# Convert based on suffixes
if (grepl("K", damage)) {
return(as.numeric(gsub("K", "", damage)) * 1e3)
} else if (grepl("M", damage)) {
return(as.numeric(gsub("M", "", damage)) * 1e6)
} else if (grepl("B", damage)) {
return(as.numeric(gsub("B", "", damage)) * 1e9)
} else {
return(as.numeric(damage))
}
}
# Apply the function to the DAMAGE_CROPS column
dangerousflood$DAMAGE_CROPS_NUM <- sapply(dangerousflood$DAMAGE_CROPS, convert_damage_crops, USE.NAMES = FALSE)
# Calculate the average crop damage by event type, omitting NAs
average_crop_damage_summary <- dangerousflood %>%
filter(grepl("Flood|Lakeshore Flood|Flash Flood|Coastal Flood", EVENT_TYPE, ignore.case = TRUE)) %>%
filter(!is.na(DAMAGE_CROPS_NUM)) %>%
group_by(EVENT_TYPE) %>%
summarize(Average_Crop_Damage = mean(DAMAGE_CROPS_NUM, na.rm = TRUE),
.groups = 'drop') # Drop grouping structure after summarizing
# Print the average crop damage summary
print(average_crop_damage_summary)
## # A tibble: 4 × 2
## EVENT_TYPE Average_Crop_Damage
## <chr> <dbl>
## 1 Coastal Flood 5602.
## 2 Flash Flood 15153.
## 3 Flood 1235.
## 4 Lakeshore Flood 0
# Create a bar plot to compare average crop damage by event type
ggplot(average_crop_damage_summary, aes(x = reorder(EVENT_TYPE, -Average_Crop_Damage), y = Average_Crop_Damage, fill = EVENT_TYPE)) +
geom_bar(stat = "identity") +
theme_minimal() +
labs(title = "Average Crop Damage by Flood Event Type", x = "Event Type", y = "Average Damage (USD)") +
scale_fill_viridis_d() # Use a viridis color scale for the fill
Flash floods emerge as the predominant culprits for crop losses, underscoring their intense and often localized impact on agricultural areas. General floods also contribute to a considerable extent to the average crop damage, although to a lesser degree than flash floods. In stark contrast, coastal and lakeshore floods are found to have a negligible effect on crop damage. This lack of impact could be due to the geographical locations of such floods, which may not typically overlap with agricultural zones. These findings highlight the variable threats that different types of floods pose to agricultural livelihoods and can inform tailored approaches to agricultural risk management and resilience building.
dangerousflood <- dangerousflood %>%
mutate(Total_Injuries = INJURIES_DIRECT + INJURIES_INDIRECT,
Total_Deaths = DEATHS_DIRECT + DEATHS_INDIRECT)
# Summarize total injuries and deaths by flood type
injuries_deaths_by_type <- dangerousflood %>%
group_by(EVENT_TYPE) %>%
summarise(Total_Injuries = sum(Total_Injuries, na.rm = TRUE),
Total_Deaths = sum(Total_Deaths, na.rm = TRUE)) %>%
ungroup()
# Reshape the data into long format
injuries_deaths_long <- injuries_deaths_by_type %>%
pivot_longer(cols = c(Total_Injuries, Total_Deaths), names_to = "Metric", values_to = "Count")
# Create the plot
ggplot(injuries_deaths_long, aes(x = EVENT_TYPE, y = Count, fill = Metric)) +
geom_bar(stat = 'identity', position = 'dodge') +
scale_fill_manual(values = c("cyan4", "maroon")) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "top") + # Move legend to the top for better readability
labs(title = 'Injuries and Deaths by Flood Type',
x = 'Flood Type',
y = 'Count',
fill = 'Metric')
Flash floods are a deadly phenomenon that can claim lives with their sudden onset and rapid escalation, as evidenced by the recorded fatalities surpassing 200, while injuries stand at approximately 75. The data suggests that the lethality of flash floods is significantly higher compared to other flood types, with deaths outnumbering injuries. This trend is also observed with general floods, albeit with a lower overall count of affected individuals. The absence of fatalities associated with lakeshore and coastal floods in the records could indicate a lower human impact, or it may reflect data limitations. As flash floods are particularly lethal, with a higher number of fatalities relative to injuries, government should be careful for such type of flood as it might have very limited reaction time for people to evacuate.
# Aggregate the data to count the number of flood events per state for each year
yearly_state_flood_counts <- dangerousflood %>%
count(STATE, YEAR) %>%
mutate(STATE = tolower(STATE)) # Ensure state names match map data
# Get US map data
states_map <- map_data("state")
# Merge map data with yearly flood counts
map_data_merged <- merge(states_map, yearly_state_flood_counts, by.x = "region", by.y = "STATE", all.x = TRUE)
# Plot the map with facets for each year
ggplot(map_data_merged, aes(x = long, y = lat, group = group)) +
geom_polygon(aes(fill = n), color = "white") +
scale_fill_gradient(low = "lightblue", high = "darkred", na.value = "grey50", name = "Flood Counts") +
facet_wrap(~YEAR) + # Facets for each year
labs(title = "Yearly Flood Event Counts by State") +
theme_void() +
theme(strip.text.x = element_text(size = 8), # Make facet labels smaller
legend.position = "bottom")
The plot of flood occurrences from 2020 to 2021 reveals a marked
escalation in certain states, most notably Texas and New York. This
upward trend in flood events signals a critical need for heightened
awareness and proactive measures from local and federal government
agencies.
This statement underscores the importance of the data findings and suggests the need for action based on the trend observed.
In reviewing the comprehensive NOAA dataset, it becomes evident that floods pose a significant risk, often resulting in fatalities rather than just injuries. The distribution of flood occurrences varies notably across different regions of the United States, with some states experiencing an increase in flood events in 2021 compared to the previous year. These incidents have led to a substantial number of deaths and injuries, underscoring the importance of effective disaster preparedness and response strategies. In particular, flash floods emerge as exceptionally perilous, causing considerable harm to individuals and communities. Furthermore, lakeshore floods, while less frequent, have the potential to inflict substantial property damage. Coastal areas, due to their relatively high incidence of flooding and associated economic losses, warrant special attention for mitigation and safety measures. Collectively, these insights highlight the critical need for heightened flood awareness and preparedness to safeguard lives and property against this natural disaster.